In-class_Ex5: Modeling the Spatial Variation of the Explanatory Factors of Water Point Status using Geographically Weighted Logistic Regression

1.1 Setting the scene

  • To build an explanatory model to discover factor affecting water point status in Osun State, Nigeria

  • Study area: Osun State, Nigeria

  • Data sets:

    • Osun.rds, contains LGAs boundaries of Osun State. It is in sf polygon data frame.

    • Osum_wp_sf_rds, contained water points within Osun State. It is in sf point data frame.

1.1.1 Model Variables

  • Dependent variable: Water point status (i.e. functional/non-functional)

  • Independent variables:

    • distance_to_primary_road

    • distance_to_secondary_road

    • distance_to_tertiary_road

    • distance_to_city

    • distance_to_town

    • water_point_population

    • local_population_1km,

    • usage_capacity

    • is_urban

    • water_source_clean

    • last 3 are categorical

1.1.2 Getting Started

The R packages required for this exercise are as follows:

  • Spatial data handling

    • sf and spdep
  • Attribute data handling

    • tidyverse
  • Choropleth mapping

    • tmap
  • Multivariate data visualisation and analysis

    • corrplot and ggpubr
  • Exploratory Data Analysis

    • funModeling
    • skimr
  • Regression Modelling

    • GWmodel - geographically weighted regression

    • caret - Classification And REgression Training

    • blorr - binary logistic regression model

The following code chunk is used to load the necessary R packages:

pacman::p_load(sf, spdep, tmap, tidyverse, tmap, funModeling, blorr, corrplot, ggpubr, GWmodel, skimr,caret)

1.1.3 Data Import

The code chunk below uses read_rds() function of tidyverse package to import rds files.

Osun is imported into R as a polygon feature data frame.

Osun_wp_sf is imported into R as a point feature data frame

These files have been cleaned and prepared.

The raw data for Osun can be obtained from WPdx Global Data Repositories. There are two versions of the data. They are: WPdx-Basic and WPdx+. We are required to use WPdx+ data set.

The raw data for Osun_wp_sf is Nigeria Level-2 Administrative Boundary (also known as Local Government Area) polygon features GIS data. The data can be downloaded either from The Humanitarian Data Exchange portal or geoBoundaries. We are required to use “nga_polnda_adm2_1m_salb” data set.

Osun <- read_rds("rds/Osun.rds")
Osun_wp_sf <- read_rds("rds/Osun_wp_sf.rds")

1.1.4 Quick Exploratory Data Analysis

The following code chunk serves to perform a quick exploratory analysis on status column of Osun_wp_sf data frame. True indicates that the waterpoint is functional and False indicates that the waterpoint is non-functional. We can see that the 2118 waterpoints which is 44.5% of the Osun’s waterpoints are non-functional. This is a worrisome situation and we should investigate the reason for such a high percentage.

Osun_wp_sf %>%
    freq(input = 'status')
Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
of ggplot2 3.3.4.
ℹ The deprecated feature was likely used in the funModeling package.
  Please report the issue at <https://github.com/pablo14/funModeling/issues>.

  status frequency percentage cumulative_perc
1   TRUE      2642       55.5            55.5
2  FALSE      2118       44.5           100.0

The following code chunk produces a choropleth map which indicates the location of functional waterpoints and non-functional waterpoints in the Osun State, Nigeria.

tmap_mode("view")
tmap mode set to interactive viewing
tm_shape(Osun) +
    tm_polygons(alpha = 0.4)+
    tm_shape(Osun_wp_sf) +
    tm_dots(col = 'status',
            alpha = 0.6)+
    tm_view(set.zoom.limits = c(9,12))

1.1.5 Summary Statistics

Summary statistics is obtained using the code chunk below using skimr(). The purpose is to have a quick glance and evaluate how many missing data are there under each field and decide which independent variable(s) to exclude in the initial regression modelling.

Eg. fecal_coliform_value has 4760 n_missing, install_year has 1144 n_missing, rehab_priority has 2654 n_missing, crucialness_score has 798 n_missing, pressure_score has 798 n_missing, rehab_year has 4760 n_missing. These are examples of independent variables which will be excluded in the initial regression model.

Osun_wp_sf %>%
    skim()
Warning: Couldn't find skimmers for class: sfc_POINT, sfc; No user-defined `sfl`
provided. Falling back to `character`.
Data summary
Name Piped data
Number of rows 4760
Number of columns 75
_______________________
Column type frequency:
character 47
logical 5
numeric 23
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
source 0 1.00 5 44 0 2 0
report_date 0 1.00 22 22 0 42 0
status_id 0 1.00 2 7 0 3 0
water_source_clean 0 1.00 8 22 0 3 0
water_source_category 0 1.00 4 6 0 2 0
water_tech_clean 24 0.99 9 23 0 3 0
water_tech_category 24 0.99 9 15 0 2 0
facility_type 0 1.00 8 8 0 1 0
clean_country_name 0 1.00 7 7 0 1 0
clean_adm1 0 1.00 3 5 0 5 0
clean_adm2 0 1.00 3 14 0 35 0
clean_adm3 4760 0.00 NA NA 0 0 0
clean_adm4 4760 0.00 NA NA 0 0 0
installer 4760 0.00 NA NA 0 0 0
management_clean 1573 0.67 5 37 0 7 0
status_clean 0 1.00 9 32 0 7 0
pay 0 1.00 2 39 0 7 0
fecal_coliform_presence 4760 0.00 NA NA 0 0 0
subjective_quality 0 1.00 18 20 0 4 0
activity_id 4757 0.00 36 36 0 3 0
scheme_id 4760 0.00 NA NA 0 0 0
wpdx_id 0 1.00 12 12 0 4760 0
notes 0 1.00 2 96 0 3502 0
orig_lnk 4757 0.00 84 84 0 1 0
photo_lnk 41 0.99 84 84 0 4719 0
country_id 0 1.00 2 2 0 1 0
data_lnk 0 1.00 79 96 0 2 0
water_point_history 0 1.00 142 834 0 4750 0
clean_country_id 0 1.00 3 3 0 1 0
country_name 0 1.00 7 7 0 1 0
water_source 0 1.00 8 30 0 4 0
water_tech 0 1.00 5 37 0 20 0
adm2 0 1.00 3 14 0 33 0
adm3 4760 0.00 NA NA 0 0 0
management 1573 0.67 5 47 0 7 0
adm1 0 1.00 4 5 0 4 0
New Georeferenced Column 0 1.00 16 35 0 4760 0
lat_lon_deg 0 1.00 13 32 0 4760 0
public_data_source 0 1.00 84 102 0 2 0
converted 0 1.00 53 53 0 1 0
created_timestamp 0 1.00 22 22 0 2 0
updated_timestamp 0 1.00 22 22 0 2 0
Geometry 0 1.00 33 37 0 4760 0
ADM2_EN 0 1.00 3 14 0 30 0
ADM2_PCODE 0 1.00 8 8 0 30 0
ADM1_EN 0 1.00 4 4 0 1 0
ADM1_PCODE 0 1.00 5 5 0 1 0

Variable type: logical

skim_variable n_missing complete_rate mean count
rehab_year 4760 0 NaN :
rehabilitator 4760 0 NaN :
is_urban 0 1 0.39 FAL: 2884, TRU: 1876
latest_record 0 1 1.00 TRU: 4760
status 0 1 0.56 TRU: 2642, FAL: 2118

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
row_id 0 1.00 68550.48 10216.94 49601.00 66874.75 68244.50 69562.25 471319.00 ▇▁▁▁▁
lat_deg 0 1.00 7.68 0.22 7.06 7.51 7.71 7.88 8.06 ▁▂▇▇▇
lon_deg 0 1.00 4.54 0.21 4.08 4.36 4.56 4.71 5.06 ▃▆▇▇▂
install_year 1144 0.76 2008.63 6.04 1917.00 2006.00 2010.00 2013.00 2015.00 ▁▁▁▁▇
fecal_coliform_value 4760 0.00 NaN NA NA NA NA NA NA
distance_to_primary_road 0 1.00 5021.53 5648.34 0.01 719.36 2972.78 7314.73 26909.86 ▇▂▁▁▁
distance_to_secondary_road 0 1.00 3750.47 3938.63 0.15 460.90 2554.25 5791.94 19559.48 ▇▃▁▁▁
distance_to_tertiary_road 0 1.00 1259.28 1680.04 0.02 121.25 521.77 1834.42 10966.27 ▇▂▁▁▁
distance_to_city 0 1.00 16663.99 10960.82 53.05 7930.75 15030.41 24255.75 47934.34 ▇▇▆▃▁
distance_to_town 0 1.00 16726.59 12452.65 30.00 6876.92 12204.53 27739.46 44020.64 ▇▅▃▃▂
rehab_priority 2654 0.44 489.33 1658.81 0.00 7.00 91.50 376.25 29697.00 ▇▁▁▁▁
water_point_population 4 1.00 513.58 1458.92 0.00 14.00 119.00 433.25 29697.00 ▇▁▁▁▁
local_population_1km 4 1.00 2727.16 4189.46 0.00 176.00 1032.00 3717.00 36118.00 ▇▁▁▁▁
crucialness_score 798 0.83 0.26 0.28 0.00 0.07 0.15 0.35 1.00 ▇▃▁▁▁
pressure_score 798 0.83 1.46 4.16 0.00 0.12 0.41 1.24 93.69 ▇▁▁▁▁
usage_capacity 0 1.00 560.74 338.46 300.00 300.00 300.00 1000.00 1000.00 ▇▁▁▁▅
days_since_report 0 1.00 2692.69 41.92 1483.00 2688.00 2693.00 2700.00 4645.00 ▁▇▁▁▁
staleness_score 0 1.00 42.80 0.58 23.13 42.70 42.79 42.86 62.66 ▁▁▇▁▁
location_id 0 1.00 235865.49 6657.60 23741.00 230638.75 236199.50 240061.25 267454.00 ▁▁▁▁▇
cluster_size 0 1.00 1.05 0.25 1.00 1.00 1.00 1.00 4.00 ▇▁▁▁▁
lat_deg_original 4760 0.00 NaN NA NA NA NA NA NA
lon_deg_original 4760 0.00 NaN NA NA NA NA NA NA
count 0 1.00 1.00 0.00 1.00 1.00 1.00 1.00 1.00 ▁▁▇▁▁

1.1.6 Selection of independent Variables

Osun_wp_sf_clean is created through the selection of the fields using the code chunk below. all_vars(!is.na(.)) removes the 4 n_missing observations for column ‘water_point_population’ and column ‘local_population_1km’. water_point_population and local_population_1km are included in the initial regression model as we deem the impact of 4 n_missing obs out of 4760 obs as negligible. mutate(usage_capacity = as.factor(usage_capacity)) is to ensure that R treats ‘usage_capacity’ as a categorical variable instead of a continuous variable.

Osun_wp_sf_clean <- Osun_wp_sf %>%
    filter_at(vars(status,
                   distance_to_primary_road,
                   distance_to_secondary_road,
                   distance_to_tertiary_road,
                   distance_to_city,
                   distance_to_town,
                   water_point_population,
                   local_population_1km,
                   usage_capacity,
                   is_urban,
                   water_source_clean),
              all_vars(!is.na(.))) %>%
    mutate(usage_capacity = as.factor(usage_capacity))

The code chunk below selects the necessary independent variables as indicated in the code chunk above for the initial regression model and dropping the geometry column so that we can use corrplot.mixed()

Osun_wp <- Osun_wp_sf_clean %>%
    select(c(7,35:39,42:43, 46:47,57)) %>% # you can create a list and point this to the list instead for more elegance.
    st_set_geometry(NULL)

In the code chunk below, corrplot.mixed() function of corrplot package is used visualise and analyse the correlation of the independent variables. The correlation plot below shows that none of the independent variables are highly correlated (>=0.85). Hence no independent variables will be removed due to Multicollinearity.

cluster_vars.cor = cor(Osun_wp[,(2:7)])
corrplot.mixed(cluster_vars.cor,
               lower = 'ellipse',
               upper = 'number',
               tl.pos = 'lt',
               diag = 'l',
               tl.col = 'black')

2.1 Building Initial Logistical Regression Model

The following code chunk uses glm() to build our initial logistical regression model.

model <- glm(status ~ distance_to_primary_road +
                 distance_to_secondary_road +
                 distance_to_tertiary_road +
                 distance_to_city +
                 distance_to_town +
                 is_urban +
                 usage_capacity +
                 water_source_clean +
                 water_point_population +
                 local_population_1km,
             data = Osun_wp_sf_clean,
             family = binomial(link = 'logit'))

Instead of using typical R report, blr_regress() is used to generate comprehensive regression output. Using this output, we identify independent variables which are not significant (P-values > 0.05) as we are using 95% confidence level. The identified independent variables are distance_to_primary_road (P-value 0.4744) and distance_to_secondary_road (0.5802).

blr_regress(model)
                             Model Overview                              
------------------------------------------------------------------------
Data Set    Resp Var    Obs.    Df. Model    Df. Residual    Convergence 
------------------------------------------------------------------------
  data       status     4756      4755           4744           TRUE     
------------------------------------------------------------------------

                    Response Summary                     
--------------------------------------------------------
Outcome        Frequency        Outcome        Frequency 
--------------------------------------------------------
   0             2114              1             2642    
--------------------------------------------------------

                                 Maximum Likelihood Estimates                                   
-----------------------------------------------------------------------------------------------
               Parameter                    DF    Estimate    Std. Error    z value     Pr(>|z|) 
-----------------------------------------------------------------------------------------------
              (Intercept)                   1      0.3887        0.1124      3.4588       5e-04 
        distance_to_primary_road            1      0.0000        0.0000     -0.7153      0.4744 
       distance_to_secondary_road           1      0.0000        0.0000     -0.5530      0.5802 
       distance_to_tertiary_road            1      1e-04         0.0000      4.6708      0.0000 
            distance_to_city                1      0.0000        0.0000     -4.7574      0.0000 
            distance_to_town                1      0.0000        0.0000     -4.9170      0.0000 
              is_urbanTRUE                  1     -0.2971        0.0819     -3.6294       3e-04 
           usage_capacity1000               1     -0.6230        0.0697     -8.9366      0.0000 
water_source_cleanProtected Shallow Well    1      0.5040        0.0857      5.8783      0.0000 
   water_source_cleanProtected Spring       1      1.2882        0.4388      2.9359      0.0033 
         water_point_population             1      -5e-04        0.0000    -11.3686      0.0000 
          local_population_1km              1      3e-04         0.0000     19.2953      0.0000 
-----------------------------------------------------------------------------------------------

 Association of Predicted Probabilities and Observed Responses  
---------------------------------------------------------------
% Concordant          0.7347          Somers' D        0.4693   
% Discordant          0.2653          Gamma            0.4693   
% Tied                0.0000          Tau-a            0.2318   
Pairs                5585188          c                0.7347   
---------------------------------------------------------------

2.2 Building Revised Logistical Regression Model

After identifying the insignificant variables, we should build a revised logistical regression model. This time, we will exclude the identified insignificant variables as found in section 2.1

model_r <- glm(status ~ distance_to_tertiary_road +
                 distance_to_city +
                 distance_to_town +
                 is_urban +
                 usage_capacity +
                 water_source_clean +
                 water_point_population +
                 local_population_1km,
             data = Osun_wp_sf_clean,
             family = binomial(link = 'logit'))

blr_regress() is used again to confirm that the revised logistical regression model does not have insignificant independent variables.

blr_regress(model_r)
                             Model Overview                              
------------------------------------------------------------------------
Data Set    Resp Var    Obs.    Df. Model    Df. Residual    Convergence 
------------------------------------------------------------------------
  data       status     4756      4755           4746           TRUE     
------------------------------------------------------------------------

                    Response Summary                     
--------------------------------------------------------
Outcome        Frequency        Outcome        Frequency 
--------------------------------------------------------
   0             2114              1             2642    
--------------------------------------------------------

                                 Maximum Likelihood Estimates                                   
-----------------------------------------------------------------------------------------------
               Parameter                    DF    Estimate    Std. Error    z value     Pr(>|z|) 
-----------------------------------------------------------------------------------------------
              (Intercept)                   1      0.3540        0.1055      3.3541       8e-04 
       distance_to_tertiary_road            1      1e-04         0.0000      4.9096      0.0000 
            distance_to_city                1      0.0000        0.0000     -5.2022      0.0000 
            distance_to_town                1      0.0000        0.0000     -5.4660      0.0000 
              is_urbanTRUE                  1     -0.2667        0.0747     -3.5690       4e-04 
           usage_capacity1000               1     -0.6206        0.0697     -8.9081      0.0000 
water_source_cleanProtected Shallow Well    1      0.4947        0.0850      5.8228      0.0000 
   water_source_cleanProtected Spring       1      1.2790        0.4384      2.9174      0.0035 
         water_point_population             1      -5e-04        0.0000    -11.3902      0.0000 
          local_population_1km              1      3e-04         0.0000     19.4069      0.0000 
-----------------------------------------------------------------------------------------------

 Association of Predicted Probabilities and Observed Responses  
---------------------------------------------------------------
% Concordant          0.7349          Somers' D        0.4697   
% Discordant          0.2651          Gamma            0.4697   
% Tied                0.0000          Tau-a            0.2320   
Pairs                5585188          c                0.7349   
---------------------------------------------------------------

2.2.1 Non-geography weighted confusion matrix

The code chunk below shows the creation of a non-geography weighted confusion matrix with cutoff = 50%

blr_confusion_matrix(model_r, cutoff = 0.5) # non-geography weighted
Confusion Matrix and Statistics 

          Reference
Prediction FALSE TRUE
         0  1300  743
         1   814 1899

                Accuracy : 0.6726 
     No Information Rate : 0.4445 

                   Kappa : 0.3348 

McNemars's Test P-Value  : 0.0761 

             Sensitivity : 0.7188 
             Specificity : 0.6149 
          Pos Pred Value : 0.7000 
          Neg Pred Value : 0.6363 
              Prevalence : 0.5555 
          Detection Rate : 0.3993 
    Detection Prevalence : 0.5704 
       Balanced Accuracy : 0.6669 
               Precision : 0.7000 
                  Recall : 0.7188 

        'Positive' Class : 1

2.2.2 Conversion from simple features (sf) to SpatialPointsDataFrame (sp)

The code chunk below selects the necessary fields and converts it to a SpatialPointsDataFrame file. Please take note of as_Spatial()

This is necessary as bw.ggwr requires SpatialPointsDataFrame file as input.

Osun_wp_sp_r <- Osun_wp_sf_clean %>% 
    select(c(status,
             distance_to_tertiary_road,
             distance_to_city,
             distance_to_town,
             water_point_population,
             local_population_1km,
             is_urban,
             usage_capacity,
             water_source_clean)) %>%
    as_Spatial()
Osun_wp_sp_r
class       : SpatialPointsDataFrame 
features    : 4756 
extent      : 182502.4, 290751, 340054.1, 450905.3  (xmin, xmax, ymin, ymax)
crs         : +proj=tmerc +lat_0=4 +lon_0=8.5 +k=0.99975 +x_0=670553.98 +y_0=0 +a=6378249.145 +rf=293.465 +towgs84=-92,-93,122,0,0,0,0 +units=m +no_defs 
variables   : 9
names       : status, distance_to_tertiary_road, distance_to_city, distance_to_town, water_point_population, local_population_1km, is_urban, usage_capacity, water_source_clean 
min values  :      0,         0.017815121653488, 53.0461399623541, 30.0019777713073,                      0,                    0,        0,           1000,           Borehole 
max values  :      1,          10966.2705628969,  47934.343603562, 44020.6393368124,                  29697,                36118,        1,            300,   Protected Spring 

2.2.3 Finding fixed bandwidth

The code chunk below bw.ggwr() to generate the fixed bandwidth which will be necessary to calibrate a generalized Geographically Weighted Regression (GWR) model.

bw.fixed_r <- bw.ggwr(status ~
                        distance_to_tertiary_road +
                        distance_to_city +
                        distance_to_town +
                        water_point_population +
                        local_population_1km +
                        usage_capacity +
                        is_urban +
                        water_source_clean,
                    data = Osun_wp_sp_r,
                    family = "binomial",
                    approach = "AIC",
                    kernel = "gaussian",
                    adaptive = FALSE,
                    longlat = FALSE)

The code chunk below reads the generated fixed bandwidth.

bw.fixed_r
bw.fixed_r <- 2377.371 # in the essence of time (long duration required for rendering, value of bw.fixed_r is found and indicated here.)

2.2.4 Implement Generalized Geographically Weighted Regression (GWR)

The code chunk below uses ggwr.basic() to implement generalized GWR. Notice that bw.fixed_r is included as input (bw = bw.fixed_r) into the function.

gwlr.fixed_r <- ggwr.basic(status ~
                  distance_to_tertiary_road +
                        distance_to_city +
                        distance_to_town +
                        water_point_population +
                        local_population_1km +
                        usage_capacity +
                        is_urban +
                        water_source_clean,
                    data = Osun_wp_sp_r,
                    bw = bw.fixed_r,
                    family = "binomial",
                    kernel = "gaussian",
                    adaptive = FALSE,
                    longlat = FALSE)
 Iteration    Log-Likelihood
=========================
       0        -1959 
       1        -1680 
       2        -1531 
       3        -1447 
       4        -1413 
       5        -1413 

The code chunk below reads the generated fixed bandwidth. The AICc are as follows:

  • Generalized linear Regression’s AICc: 5708.923

  • Geographically Weighted Regression’s AICc : 4744.213

Hence we are determine that the Geographically Weighted Regression is better as lower AICc is preferred.

gwlr.fixed_r #top is GLM, bottom is geographical version #AICc 
   ***********************************************************************
   *                       Package   GWmodel                             *
   ***********************************************************************
   Program starts at: 2022-12-17 23:51:14 
   Call:
   ggwr.basic(formula = status ~ distance_to_tertiary_road + distance_to_city + 
    distance_to_town + water_point_population + local_population_1km + 
    usage_capacity + is_urban + water_source_clean, data = Osun_wp_sp_r, 
    bw = bw.fixed_r, family = "binomial", kernel = "gaussian", 
    adaptive = FALSE, longlat = FALSE)

   Dependent (y) variable:  status
   Independent variables:  distance_to_tertiary_road distance_to_city distance_to_town water_point_population local_population_1km usage_capacity is_urban water_source_clean
   Number of data points: 4756
   Used family: binomial
   ***********************************************************************
   *              Results of Generalized linear Regression               *
   ***********************************************************************

Call:
NULL

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-129.368    -1.750     1.074     1.742    34.126  

Coefficients:
                                           Estimate Std. Error z value Pr(>|z|)
Intercept                                 3.540e-01  1.055e-01   3.354 0.000796
distance_to_tertiary_road                 1.001e-04  2.040e-05   4.910 9.13e-07
distance_to_city                         -1.764e-05  3.391e-06  -5.202 1.97e-07
distance_to_town                         -1.544e-05  2.825e-06  -5.466 4.60e-08
water_point_population                   -5.098e-04  4.476e-05 -11.390  < 2e-16
local_population_1km                      3.452e-04  1.779e-05  19.407  < 2e-16
usage_capacity1000                       -6.206e-01  6.966e-02  -8.908  < 2e-16
is_urbanTRUE                             -2.667e-01  7.474e-02  -3.569 0.000358
water_source_cleanProtected Shallow Well  4.947e-01  8.496e-02   5.823 5.79e-09
water_source_cleanProtected Spring        1.279e+00  4.384e-01   2.917 0.003530
                                            
Intercept                                ***
distance_to_tertiary_road                ***
distance_to_city                         ***
distance_to_town                         ***
water_point_population                   ***
local_population_1km                     ***
usage_capacity1000                       ***
is_urbanTRUE                             ***
water_source_cleanProtected Shallow Well ***
water_source_cleanProtected Spring       ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 6534.5  on 4755  degrees of freedom
Residual deviance: 5688.9  on 4746  degrees of freedom
AIC: 5708.9

Number of Fisher Scoring iterations: 5


 AICc:  5708.923
 Pseudo R-square value:  0.129406
   ***********************************************************************
   *          Results of Geographically Weighted Regression              *
   ***********************************************************************

   *********************Model calibration information*********************
   Kernel function: gaussian 
   Fixed bandwidth: 2377.371 
   Regression points: the same locations as observations are used.
   Distance metric: A distance matrix is specified for this model calibration.

   ************Summary of Generalized GWR coefficient estimates:**********
                                                   Min.     1st Qu.      Median
   Intercept                                -3.7021e+02 -4.3797e+00  3.5590e+00
   distance_to_tertiary_road                -3.1622e-02 -4.5462e-04  9.1291e-05
   distance_to_city                         -5.4555e-02 -6.5623e-04 -1.3507e-04
   distance_to_town                         -8.6549e-03 -5.2754e-04 -1.6785e-04
   water_point_population                   -2.9696e-02 -2.2705e-03 -1.2277e-03
   local_population_1km                     -7.7730e-02  4.4281e-04  1.0548e-03
   usage_capacity1000                       -5.5889e+01 -1.0347e+00 -4.1960e-01
   is_urbanTRUE                             -7.3554e+02 -3.4675e+00 -1.6596e+00
   water_source_cleanProtected.Shallow.Well -1.8842e+02 -4.7295e-01  6.2378e-01
   water_source_cleanProtected.Spring       -1.3630e+03 -5.3436e+00  2.7714e+00
                                                3rd Qu.      Max.
   Intercept                                 1.3755e+01 2171.6375
   distance_to_tertiary_road                 6.3011e-04    0.0237
   distance_to_city                          1.5921e-04    0.0162
   distance_to_town                          2.4490e-04    0.0179
   water_point_population                    4.5879e-04    0.0765
   local_population_1km                      1.8479e-03    0.0333
   usage_capacity1000                        3.9113e-01    9.2449
   is_urbanTRUE                              1.0554e+00  995.1841
   water_source_cleanProtected.Shallow.Well  1.9564e+00   66.8914
   water_source_cleanProtected.Spring        7.0805e+00  208.3749
   ************************Diagnostic information*************************
   Number of data points: 4756 
   GW Deviance: 2815.659 
   AIC : 4418.776 
   AICc : 4744.213 
   Pseudo R-square value:  0.5691072 

   ***********************************************************************
   Program stops at: 2022-12-17 23:51:51 

The code chunk below converts gwlr.fixed_r into a data frame and assign to gwr.fixed_r. This includes the independent variables which has undergone standardization.

gwr.fixed_r <- as.data.frame(gwlr.fixed_r$SDF)

The code chunk below converts yhat data where >=0.5 is T and otherwise is F and assign data to ‘most’ field.

gwr.fixed_r <- gwr.fixed_r %>%
    mutate(most = ifelse(
        gwr.fixed_r$yhat >= 0.5, T, F))

The code chunk below changes the field ‘y’ and ‘most’ to categorical data.

gwr.fixed_r$y <- as.factor(gwr.fixed_r$y)
gwr.fixed_r$most <- as.factor(gwr.fixed_r$most)

2.2.4 Geographically Weighted Confusion Matrix

CM_r <- confusionMatrix(data = gwr.fixed_r$most, reference = gwr.fixed_r$y)
CM_r# geography weighted
Confusion Matrix and Statistics

          Reference
Prediction FALSE TRUE
     FALSE  1833  268
     TRUE    281 2374
                                          
               Accuracy : 0.8846          
                 95% CI : (0.8751, 0.8935)
    No Information Rate : 0.5555          
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.7661          
                                          
 Mcnemar's Test P-Value : 0.6085          
                                          
            Sensitivity : 0.8671          
            Specificity : 0.8986          
         Pos Pred Value : 0.8724          
         Neg Pred Value : 0.8942          
             Prevalence : 0.4445          
         Detection Rate : 0.3854          
   Detection Prevalence : 0.4418          
      Balanced Accuracy : 0.8828          
                                          
       'Positive' Class : FALSE           
                                          
# need to have a localized strategy to identify specific reason for the waterpoint turned non-functional

Accuracy : 0.8837

Sensitivity (probability of true positive) : 0.8628

Specificity (probability of true negative): 0.9005

Osun_wp_sf_selected <- Osun_wp_sf_clean %>%
    select(c(ADM2_EN, ADM2_PCODE,
             ADM1_EN, ADM1_PCODE,
             status))
gwr_sf.fixed_r <- cbind(Osun_wp_sf_selected, gwr.fixed_r)

2.2.5 Choropleth Mapping of Model Probability

The code chunk below indicates the location of waterpoints with the probability of it being functional (dark-colored) or non-functional (light-colored).

tmap_mode("view")
tmap mode set to interactive viewing
prob_T <- tm_shape(Osun) +
    tm_polygons(alpha = 0.1) +
    tm_shape(gwr_sf.fixed_r) +
    tm_dots(col = 'yhat',
            border.col = 'gray60',
            border.lwd = 1) +
    tm_view(set.zoom.limits = c(8,14))
prob_T